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Abstract 

Background: Genome-wide disease-gene finding approaches may sometimes provide us with a long list of 
candidate genes. Since using pure experimental approaches to verify all candidates could be expensive, a number 
of network-based methods have been developed to prioritize candidates. Such tools usually have a set of 
parameters pre-trained using available network data. This means that re-training network-based tools may be 
required when existing biological networks are updated or when networks from different sources are to be tried. 

Results: We developed a parameter-free method, interconnectedness (ICN), to rank candidate genes by assessing 
the closeness of them to known disease genes in a network. ICN was tested using 1,993 known disease-gene 
associations and achieved a success rate of -44% using a protein-protein interaction network under a test scenario 
of simulated linkage analysis. This performance is comparable with those of other well-known methods and ICN 
outperforms other methods when a candidate disease gene is not directly linked to known disease genes in a 
network. Interestingly, we show that a combined scoring strategy could enable ICN to achieve an even better 
performance (-50%) than other methods used alone. 

Conclusions: ICN, a user-friendly method, can well complement other network-based methods in the context of 
prioritizing candidate disease genes. 



Background 

The wide applications of high-throughput techniques 
have enabled researchers to investigate disease mechan- 
isms in a genome-wide scale [1,2]. However, one chal- 
lenge is that these techniques are usually unable to 
precisely pinpoint the causative genes. For example, a 
linkage analysis may give a disease-linked chromosomal 
region, which may harbor hundreds of candidate genes 
[3,4]; an association study may identify a number of 
false positives if the disease under investigation has a 
complex inheritance pattern [5]. While a whole genome 
re-sequencing can find a number of genetic variations in 
a patient, only a few of them may play a role in the dis- 
ease etiology [1]. Therefore, time-consuming and 
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laborious experiments are usually required to determine 
the real disease genes from a large number of candidates 
given by high-throughput experiments. One strategy to 
accelerate the whole disease gene finding process is to 
use a computational approach to prioritize candidate 
genes. 

Many computational approaches for prioritizing candi- 
date genes have been developed, assuming that one dis- 
ease could be caused by a group of functionally related 
genes. Such approaches measure the functional similar- 
ity of each candidate gene to known disease genes using 
experimentally verified biological data (for details see 
review [6-9] and Additional File 1). Among these 
approaches, network-based ones have shown a good 
performance. The working hypothesis of network-based 
methods is that genes causing one disease are likely to 
locate closely to each other in a biological network 
[6,10]. Some network-based methods prioritize 
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candidate genes based on whether they directly interact 
with known disease genes [11,12]; other methods further 
consider the shortest-path distance between candidate 
genes and known disease genes in a network when 
direct links do not exist [13,14]. On the other hand, dif- 
ferent methods might employ distinct scoring strategies. 
Lage et al. [15] developed a Bayesian predictor that 
could combine interactome and phenome to infer puta- 
tive protein complexes likely to associate with a disease. 
The CIPHER method scores the candidate genes using a 
regression model of phenotype similarity and gene clo- 
seness in a network [16]. Other network-based algo- 
rithms, such as random walk [17], network flow [18], 
page rank [19], network partition [20], and network 
clustering [21], were also designed to prioritize candi- 
date disease genes. 

Network-based methods usually have some para- 
meters that need to be trained using available data 
sets. The random walk method needs a parameter to 
control the probability of returning to the initial node 
[17], and the network flow algorithm uses a parameter 
to describe the relative importance of prior informa- 
tion [18]. Lage's method requires determining several 
parameters in order to build the predictor [15]. When- 
ever biological networks are updated or new training 
data become available, their parameters should be re- 
tuned in order to optimize their performance. It may 
be difficult for biologists to rep eat these processes by 
themselves. Additionally, a parameter set may just 
work for certain cases. Here, we take the random walk 
(RW) method as an example. Although a parameter 
setting (r = 0.5) of RW appears to suffice the identifi- 
cation of many disease genes, using other parameters 
may be required to find certain disease genes (Figure 
1). How to intelligently choose the parameters could 
be a difficult task to users. We argue that a parameter- 
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Figure 1 Associations between parameter values and disease- 
gene association predictions. There are 220 disease-gene 
association cases in this example. The parameter r is used in 
random walk method to control the probability of returning to 
initial node [17]. The solid blocks indicate this method with a 
specific parameter value successfully gives the true disease genes 
the highest ranking (for details see the Materials and methods). 



free algorithm could be more useful to users in this 
regard. 

In this study, we propose a new candidate gene priori- 
tization approach that measures the interconnectedness 
(ICN) between genes in a network. It was designed to 
be a parameter-free method. Unlike other network- 
based methods, ICN measures closeness of each candi- 
date genes to known disease genes by taking alternative 
paths into consideration, in addition to the direct link 
and the shortest-path distance. In comparison with 
other outperforming network-based methods, ICN is a 
competitive method. In particular, we show that an 
impressive performance of prioritizing candidate disease 
genes could be achieved by combining ICN with other 
network-based methods. Finally, a novel type of spino- 
cerebellar ataxia (SCA) was chosen to demonstrate the 
ability of this method. 

Results and discussion 

Principles of the interconnectedness-based method 

Most network-based gene prioritization methods, 
including this one we have developed here, were created 
on the basis that causative genes of one disease may 
tend to locate closely in the network [6,10]. The 
approaches taken by various methods differ on how clo- 
seness between genes is measured. Before this method is 
developed, other network-based methods prioritize can- 
didate genes by finding direct-linked disease genes or 
close disease genes using shortest-path distance. One 
concern with these previous methods is that they might 
be less effective than expected if there are noises or 
missing direct links in the network used to measure 
inter-gene closeness. Consequently, we designed the 
InterConnectedNess-based method, ICN, to measure the 
closeness between genes by considering alternative 
paths, in addition to the shortest one, that could con- 
nect candidate genes to known disease genes. Briefly, 
ICN determines that these genes are more likely to 
belong to the same functional module if two genes have 
more shared interacting genes. A functional module 
may correspond to a protein complex [15,18] or to a 
signalling pathway [22], If a functional module is impli- 
cated with a disease, changes to a member gene in this 
module may cause this disease [23,24]. We applied ICN 
to the problem of prioritizing disease candidate genes. 

Comparison with other network-based prioritization 
algorithms 

According to the comprehensive comparison performed 
in [25], the best two outperforming methods for priori- 
tizing candidate genes were the Random Walk method 
(RW) [17] and the PRINCE (PRIoritizatioN and Com- 
plex Elucidation) algorithm (PR) [18]. In this project, 
they were re-implemented in order to compare their 
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performance with that of ICN. Their parameters were 
optimized as described in [18] (for details see Materials 
and Methods). 

Two biological networks were recruited as the data 
sets to evaluate the performance of ICN and other two 
methods. These networks were chosen because each 
network has features distinctive from that of the other. 
We intended to examine if each method could perform 
in a consistent manner using different types of network 
data. The first one is a protein-protein interaction net- 
work (PIN) consisting of 140,382 interactions and 
12,164 genes. PIN consists of data retrieved from nine 
protein-protein interaction data sources [26-34]. The 
second one is a functional association network (FAN) 
consisting of 1,217,908 interactions and 16,648 genes 
downloaded from the STRING database [35]. These two 
networks share 11,776 common genes and 95,630 com- 
mon interactions. Two major differences between these 
data sets are the number of interactions and the types 
of edges. While PIN edges are un-weighted, FAN edges 
are annotated with weights indicating the confidence of 
functional linkage between each pair of connected genes 
[36]. ICN is able to incorporate edge weights in quanti- 
fying the closeness between genes in a network. The sta- 
tistics of available data in each network is summarized 
in Table 1. 

A leave-one-out procedure was employed to carry out 
the evaluation. The disease-gene associations were 
obtained from OMIM [37]. These genes were manually 
grouped in to different disease families as described in 
Materials and Methods. In each validation trial, the 
association of one test gene with a disease family was 
removed, and each method was tried to re-build this 
association. To mimic the situations we may encounter 
when using different high-throughput genome-wide 
techniques to find disease genes, we created two test 
scenarios, the simulated linkage analysis and the whole 
genome scan. In the simulated linkage analysis, each 
time a test disease gene together with 100 genes on its 
flanking regions was taken as the candidate set. In the 
whole genome scan, each time a test disease gene 



together with all human genes in the network, excluding 
other members from the corresponding disease family, 
was taken as the candidate set. If a test gene was ranked 
top k in a candidate set in a trial, this trial was regarded 
as a successful one. We further defined the "success 
rate" as the fraction of successful trials for a method 
under a particular test scenario. 

The results of simulated linkage analysis for each 
method are presented in Figure 2. 1,993 and 2,616 dis- 
ease-gene associations were tested using PIN and FAN, 
respectively. When PIN was used, ICN achieved the best 
performance with a success rate of 44.7%, ranking the 
known disease genes as top 1 candidate (k=l) in 870 
out of 1,993 cases. RW and PR also achieved the similar 
performance with a success rate of 43.3% (862/1993) 
and 43.4% (865/1993), respectively. When the rank cut- 
off (k) was increased, PR had the best performance, 
while the performance of ICN was still comparable with 
that of PR (Figure 2A). When FAN was used, RW 
achieved a success rate of 71.3% (1865/2616), better 
than that ICN (64.1%, 1678/2616) and PR (66.4%, 1738/ 
2616) did. On the other hand, as rank cutoff was 
increased (k >= 5), the performance of ICN and PR was 
better than that of RW (Figure 2B). 

The performance comparison under the test scenario 
of whole genome scan is shown in Figure 3. When PIN 
was used, ICN successfully ranked the known disease 
genes as top 1 candidate in 192 out of 1,993 cases, with 
a success rate of 0.096. RW performance with a success 
rate of 15.0% (299/1,993) was higher than ICN and PR 
(6.9%, 137/1,993). Similarly, the performance of ICN 
(10.4%, 272/2,616) was between RW (19.1%, 499/2,616) 
and PR (6.7%, 174/2,616) when FAN was used. The 
benchmark reveals that although ICN did not outper- 
form in all cases, it was quite comparable to other 
methods. 

If the cases with disease genes being ranked as top 1 
candidates by at least one of three prioritization meth- 
ods were considered as successful predictions, the over- 
all success rates so achieved were 54.3 % (1,083/1,993) 
by using PIN and 79.2% (2,073/2,616) by using FAN, 



Table 1 Statistics of biological networks 



Protein-protein interaction network (PIN) Functional association network 

(FAN) 



Data source(s) 


Integration from DIP, BOND, IntAct, MINT, MIPS, HPRD, BioGRID, Reactome, and pathway 
commons 


STRING v8.2 


Network type 


Unweighted 


Weighted 


# genes 


12,164 


16,648 


# interactions 


140,382 


1,217,908 


# disease families 


344 


509 


# disease-gene 


1,993 


2,616 


associations 






# disease genes 


1,640 


1,909 
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respectively, under the test scenario of simulated linkage 
analysis. The overall performance was much better than 
that of respective methods. Figure 4 presents the over- 
laps of successful predictions among ICN, RW, and PR. 
No matter which biological network was used, RW and 
PR shared more success cases than other combinations. 
This is not really surprising, since RW and PR took a 
similar iterative procedure to look for candidate genes 
in a network [17,18]. Interestingly, each method pre- 
dicted unique cases. In particular, ICN gave the highest 
number of unique success cases using PIN, and it gave a 
comparable number of unique cases with that of RW 
using FAN. These results indicate that each method 
may perform better than other methods on certain 
cases. Analyzing the difference of the unique success 
cases generated by different methods may help us get a 
deeper understanding of unique advantage of each 
method, which could assist us to further improve the 
performance. 

Exploring the cases uniquely predicted by respective 
methods 

Intuitively, topological properties of genes in a network 
may affect the performance of candidate gene prioritiza- 
tion when network-based methods are used. To under- 
stand how the performance of different methods could 
be influenced, we examined if the disease genes uniquely 
identified by individual methods had distinctive topolo- 
gical properties. For simplicity, disease genes uniquely 



identified by ICN are denoted as ICN-unique genes/ 
cases, and so forth for other methods, in the following 
text. 

Firstly, the number of interacting partners, also 
referred to as the degree in the graph theory [38], of 
each method-unique case was considered. We noticed 
that when PIN was used, the average degree of RW- 
unique cases was significantly higher than these of ICN- 
and PR-unique cases (P-value = 0.002 and 2.9x10-6, 
Wilcoxon signed-rank test). Secondly, we explored to 
which extent a method-unique gene may be located, in 
a network, away from the known genes implicated in a 
disease family. Here we found that when PIN was used, 
the distribution of the shortest-path distances of ICN- 
unique cases is similar to that of PR-unique cases (Fig- 
ure 5B). Both ICN-unique and PR-unique cases are sig- 
nificantly more distant from known disease genes than 
that of RW-unique cases (P-values = 1.9x10-5 and 
2.6x10-5, respectively, Wilcoxon signed-rank test). The 
analysis of the method-unique cases using FAN yielded 
a similar result (Additional File 2). 

On the whole, these results support that a prioritiza- 
tion method may outperform the others when candi- 
date disease genes to be assessed have certain 
method-favored topological properties. When candi- 
date genes have more interacting partners in a net- 
work and are closer to other known disease genes, 
RW may perform better than the other methods. In 
contrast, ICN and PR may outperform RW when 
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A performance of prioritization methods tested on whole genome scans. The performance of different methods is assessed by 
(A) and FAN (B), respectively. 



prioritizing candidate genes that are more distant 
away from other known disease genes in a network. 
Therefore, it is quite possible that combining the 
ranking results of different methods may further 



improve the performance of candidate gene prioritiza- 
tion. In the next section, we show that a combined 
scoring strategy did improve the performance of prior- 
itizing candidate disease genes. 
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Figure 4 Venn diagram of successfully predicted cases among different prioritization methods. The cases which are successfully ranked 
the known disease genes as top 1 candidate are compared among ICN, RW, and PR by using PIN (A) and FAN (B), respectively. 
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Improving the performance using a combined scoring 
strategy 

Since each method may have its own favorite cases, we 
tried to improve the performance of prioritization by 
combining the results generated by different methods. 

To preserve the unique advantage of each method, we 
did not change any algorithmic approaches in them. 
Instead, we used a combined scoring strategy by multi- 
plying together the ranks generated by different methods 
(for details see section Materials and Methods). The 
performance of this new approach was also evaluated 
using the leave-one-out procedure under a test scenario 
of either simulated linkage analysis and whole-genome 
scan. 

Table 2 lists the performances of respective methods 
and different combined scoring schemes tested in the 
simulated linkage scenario. Here, we denote the scoring 
scheme of combining the ranking results of ICN and PR 
as the ICN-PR method, and so forth. Interestingly, all 
combined scoring schemes achieved higher success rates 
than respective methods. When PIN was used, the ICN- 
PR method showed the best performance (success rate 
48.9%). Besides, the ICN-RW method also showed a bet- 
ter success rate (46.9%) than respective methods. On the 
other hand, when FAN was used, the RW-PR method 
outperformed the other individual and combined meth- 
ods (success rate 73.7%). The ICN-PR method achieved 
a success rate (72. 7%) close to the best one. All the 
combined scoring schemes made substantial perfor- 
mance improvement compared to respective methods 



Table 2 Success rates of ranking known disease genes as 
the best candidate 



Success rate 


ICN 


RW 


PR 


ICN- 


ICN- 


RW- 


ICN-RW- 


(%) 








RW 


PR 


PR 


PR 


PIN 


43.7 


43.3 


43.4 


46.9 


48.9 


46.8 


44.5 


FAN 


64.1 


71.3 


66.4 


72.7 


73.3 


73.7 


72.4 



(ICN: 64.1%, RW: 71.3%, PR: 66.4%). Finally, when these 
combined scoring schemes were tested in the whole 
genome scan scenario, no performance improvement 
could be found (data not shown). It is not surprising 
since we expect that there could be missing parts in 
currently available biological networks and more genes 
are yet to be identified to fill in the networks. 

Here we further explored if the cases failed when 
respective methods were used could be recovered using 
the combined scoring schemes. The result is listed in 
Table 3. When PIN was used, 11 and 25 cases (out of 
911 cases failed using respective methods) could be 
recovered by the ICN-RW and the ICN-PR methods, 
respectively, but no cases could be recovered by the 
RW-PR method or the ICN-RW-PR method. We also 
tested if it could make a difference if FAN was used. It 
turned out that the ICN-RW method and the ICN-PR 
method rescued 27 and 22 cases (out of 543 cases failed 
using respective methods), respectively. The RW-PR 
method could rescue only one case, and the ICN-RW 
-PR method did not really show a much better perfor- 
mance (4 cases rescued). 
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Figure 5 Analysis of network topological properties on disease causing genes. The topological properties of disease genes in unique cases 
which were successfully ranked the known disease genes as top 1 candidate by a specific method in PIN (Figure 3A) were compared in degree 
(A) and average shortest-path distance between other disease-associated genes which are in the same disease family(B). 
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Table 3 Failed prediction cases recovered by combined 
methods 



# failed prediction cases & 


# cases re-ranked as top 1 candidate 




ICN-RW ICN-PR RW-PR ICN-RW-PR 


PIN 911 
FAN 543 


11 25 0 0 
27 22 1 4 



& a failed prediction case indicates that no prioritization method can rank the 
true disease gene as top 1 candidate. 



All in all, combining the results of different network- 
based methods indeed enhances the performance of 
prioritizing candidate disease genes. In particular, sub- 
stantial performance improvement was made when 
combining ICN with other methods. 

Using ICN and combined scoring schemes to find 
spinocerebellar ataxia genes 

To demonstrate the ability of ICN and the combined 
scoring schemes in finding novel disease genes, we pre- 
sent a case study for spinocerebellar ataxia type 22 
(SCA22) [39]. Autosomal dominant spinocerebellar atax- 
ias (SCAs) are a group of progressive neurodegenerative 
disorders characterized by the loss of balance and motor 
coordination due to dysfunction of the cerebellum [40]. 
SCAs are genetically heterogeneous. To date, more than 
30 genomic loci have been linked to different subtypes 
of SCA; however, only 18 causative genes have been 
determined [41,42]. Interestingly, these genes share 
common interacting partners [43], suggesting that net- 
work-based methods could be suitable for finding novel 
SCA-causing genes. SCA22 has been found to link to 
the locus on chromosome lq21-23 [39], where 541 pro- 
tein-coding genes were annotated (Ensembl release 58, 
http://www.ensembl.org). Our aim was to prioritize 
these 541 candidate disease genes. 

The confirmed SCA-causing genes in Table 4 were 
regarded as known disease genes for the SCA disease 
family. There were 15 and 17 of them in PIN and FAN, 
respectively. Table 5 and 6 present the top 10 candidate 
genes (i.e. k = 10) prioritized using PIN and FAN, 
respectively. Firstly, we tested individual methods. We 
noticed that ICN, RW, and PR generated very different 
results. No identical top one gene could be consistently 
determined by different methods. In addition, when PIN 
was used, only 2 genes, SPTA1 and GNAT2, were com- 
monly identified by all methods (k = 10, Table 5). Simi- 
larly when FAN was used, only 3 genes (KCNN3, 
SPTA1, and KCNC4) commonly identified by all meth- 
ods (k = 10, Table 6). 

Secondly, we tested combined scoring schemes and 
they appeared to generate more consistent results. When 
PIN and FAN were used respectively, there were corre- 
spondingly three (SPTA1, GNAT2, and NRAS) and 



Table 4 List of SCA-causing genes 


SCA subtype 


Gene 


PIN & 


FAN & 


SCA1 


ATXN1 


Y 


Y 


SCA2 


ATXN2 


Y 


Y 


SCA3 


ATXN3 


Y 


Y 


SCA5 


SPTBN2 


Y 


Y 


SCA6 


CACNA1 A 


Y 


Y 


SCA7 


ATXN7 


Y 


Y 


SCA8 


ATXN8 


N 


N 


SCA 10 


ATXN10 


Y 


Y 


SCA1 1 


^BK2 


Y 


Y 


SCA 12 


PPP2R2B 


Y 


Y 


SCA 13 


KCNC3 


N 


Y 


SCA 14 


PRKCG 


Y 


Y 


SCA 15 


ITPR1 


Y 


Y 


SCA 17 


TBP 


Y 


Y 


SCA27 


FGF14 


N 


Y 


SCA28 


AFG3L2 


Y 


Y 


SCA31 


PLEKHG4 


Y 


Y 


DRPLA 


ATN1 


Y 


Y 



& whether the disease genes are in the given network. DRPLA: dentatorubral- 
pallidoluysian atrophy 



seven (KCNN3, SPTA1, CCT3, KCNC4, KCNA2, 
KCND3, and KCNA3) common genes identified by all 
combined scoring schemes (k - 10, Table 5 and 6). 
Furthermore, SPTA1 and KCNN3 were consistently 
picked out as the best candidates by all combined scoring 
schemes using PIN and FAN, respectively. SPTA1 was 
also ranked in the top 3 candidate genes by combined 
scoring schemes when FAN was used. KCNN3 was not 
included in the candidate list when PIN was used because 
there was no interaction information for KCNN3. 

From protein function and literature survey, we found 
that SPTA1 and KCNN3 are very likely to associate 
with SCA22. SPTA1 is a member of spectrin family, 
functioning in actin crosslinking and as the molecular 
scaffold proteins to determine cell shapes and to arrange 
the transmembrane proteins. An in-frame deletion in 



Table 5 Top 10 candidate genes for SCA22 by using PIN 



Rank 


ICN 


RW 


PR 


ICN-RW 


ICN-PR 


RW-PR 


ICN-RW- 
PR 


1 


SPTA1 


NRAS 


YY1AP1 


SPTA1 


SPTA1 


SPTA1 


SPTA1 


2 


GNAT2 


SPTA1 


ECM1 


GNAT2 


YY1AP1 


AHCYL1 


GNAT2 


3 


TAF13 


GNAT2 


AHCYL1 


NRAS 


GNAT2 


NRAS 


NRAS 


4 


ISG20L2 


GNAI3 


FDPS 


ISG20L2 


TAF13 


ECM1 


YY1AP1 


5 


FCGR2C 


AHCY1 


SPTA1 


FCGR2C 


ECM1 


YY1AP1 


ECM1 


6 


YY1AP1 


STXBP3 


STXBP3 


TAF13 


NRAS 


GNAT2 


TAF13 


7 


PSMD4 


ECM1 


S100A7 


PSMD4 


STXBP3 


STXBP3 


STXBP3 


8 


NRAS 


CCT3 


POLR3C 


GNAI3 


PSMD4 


GNAI3 


AHCYL1 


9 


NGF 


RPS27 


UBAP2L 


NGF 


POLR3C 


S100A7 


GNAI3 


10 


NTRK1 


S100A7 


GNAT2 


NTRK1 


NGF 


FDPS 


PSMD4 
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Table 6 Top 10 candidate genes for SCA22 by using FAN 


Rank 


ICN 


RW 


PR 


ICN-RW 


ICN-PR 


RW-PR 


ICN-RW-PR 


1 


S100A6 


SPTA1 


KCNN3 


KCNN3 


KCNN3 


KCNN3 


KCNN3 


2 


KCNN3 


CCT3 


HCN3 


SPTA1 


S100A6 


SPTA1 


SPTA1 


3 


NGF 


KCNN3 


SPTA1 


S100A6 


SPTA1 


CCT3 


S100A6 


4 


KCNA2 


S100A11 


PPM1J 


KCNA2 
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SPTBN2, which is also a member of the spectrin family, 
can cause SCA5 [44]. Recent studies have shown that 
the mutant SPTBN2 disrupts fundamental intracellular 
transport processes in synapses [45-47]. This is likely to 
contribute to progressive neurodegenerative disease, 
such as SCA. Therefore, SPTA1 may cause SCA22 in a 
similar mechanism. Besides, KCNN3 is a member of the 
gene family encoding the small conductance calcium- 
activated potassium channels. A CAG repeat poly- 
morphism has been annotated in the amino-terminal 
coding region of KCNN3 [48]. Many studies revealed 
that such repeat polymorphisms associate with psychia- 
tric diseases, such as schizophrenia [49] and bipolar dis- 
eases [50]. 

To further validate these two candidates experimen- 
tally, an exome sequencing experiment was performed, 
and several novel gene variations have been found on 
SPTA1 in two SCA22 patients (Chung, M.-Y. et al., 
unpublished data). This preliminary result we present 
here suggests that ICN and the combined scoring 
schemes are able to identify the novel disease genes. 

Conclusions 

The InterConnectedNess-based method (ICN) is a bio- 
logically intuitive and parameter-free approach for prior- 
itizing candidate disease genes. There is no need for 
users to train the parameters every time when biological 
networks to be used are updated. ICN not only was 
comparable to other well-known methods, such the ran- 
dom walk method (RW) and the PRINCE algorithm 
(PR), but also outperformed these methods when candi- 
date disease genes are located more distantly to known 
disease genes in a network. Furthermore, combined 
ICN-RW or ICN-PR scoring schemes showed an 
impressive performance improvement in prioritizing 
candidate disease genes, suggesting that different net- 
work-based methods may complement the weakness of 
each other. 

In this study, we created a very simple combined scor- 
ing strategy by multiplying the ranks generated by 



different methods. The success of this strategy implies 
that there might still be a chance to further improve the 
performance of network-based methods in prioritizing 
candidate disease genes. To achieve this, we plan to try 
other strategies. In addition to combining method-speci- 
fic ranking results, combining network-specific ranking 
results appears to be another promising strategy. In fact, 
two algorithms, N-dimensional order statistics (NDOS) 
[51] and discounted rating system (DRS) [52], have been 
employed in some prioritization methods to combine 
ranking results generated respectively by using different 
network data sets. It would be interesting to find out if 
the performances of ICN or other network-based meth- 
ods can still be advanced when more heterogeneous 
approaches are integrated together. 

Materials and methods 

Biological networks 

Two kinds of biological networks were employed to 
test the performance of network-based methods in this 
study: protein-protein interaction network (PIN) and 
functional association network (FAN). PIN was con- 
structed by integrating protein-protein interaction data 
from nine databases, including DIP [26], BIND [27], 
IntAct [28], MIPS [29], MINT [30], HPRD [31], Bio- 
GRID [32], Reactome [33], and Pathway Commons 
[34]. Another dataset, FAN, was obtained from 
STRING v8.2, which was a comprehensive gene asso- 
ciation dataset containing directly physical interactions 
and functional links from experimental evidence and 
computational methods [35]. In both networks, the 
identifier for each gene was mapped to Entrez Gene 
ID, and self-interacting pairs were removed. Finally, 
PIN consists with 140,382 interactions and 12,164 
genes, and FAN consists of 1,217,908 interactions and 
16,648 genes (Table 1). Each connection in FAN was 
assigned a confidence scores from STRING, which 
reflects the confidence of each gene-gene association. 
PIN and FAN were regarded as unweighted and 
weighted networks, respectively. 
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Disease-gene associations 

The disease-gene associations were retrieved from the 
Morbid Map in OMIM [37]. If the causative genes were jq^ 
not included in the networks, their associations to dis- 
eases were removed. Because the prioritizing methods 
require related disease genes for prediction, the related 
causative genes were manually grouped into a disease 
family based on their given disorder name [53], and dis- 
ease families that have only one causative gene were fil- 
tered out. In total, 1,993 disease-gene associations 
implicated with 344 disease families were recruited in 
PIN and 2,616 disease-gene associations implicated with 
509 disease families were recruited in FAN (Table 1). 

Interconnectedness (ICN) between genes 

The closeness between genes in a network was quanti- 
fied by considering not only direct interaction of two 
genes but also the number of connectors between genes. 
As illustrated in Figure 6, the interconnectedness score 
ICN if j between two genes i and / was defined as: k { 



2 * C0 {] + 



MG(N f nN-) 



(1) 



where N is the neighboring genes of a given gene, and 
u is the gene linked to both gene i and /. is a weight 
of the connection between two genes, e.g. co i} j corre- 
sponds to the weight between gene i and In FAN, the 
value of co is within the interval between 0 and 1. In 
PIN, however, co is either 1 or 0, i.e. connected or 
unconnected. Because the number of connectors may be 
associated with the number of neighbors of each node, 
the number of connectors between two genes is normal- 
ized by the expected number of connectors between 
these genes. k t is the sum of weights of gene i's neigh- 
boring connections and is defined as: 



(2) 



U 



B 





H 



Figure 6 Illustration of interconnectedness between genes. This illustrates the interconnectedness (ICN) between gene /' and j. Each node 
represents a gene and each edge represents a either physical interaction or functional association, co is the weight of each connection, u is the 
set of connectors, which interact with both gene /' and j. 



Hsu et al. BMC Genomics 201 1, 12(Suppl 3):S25 
http://www.biomedcentral.com/1471-2164/12/S3/S25 



Page 1 0 of 1 2 



In an unweighted network, k t corresponds directly to 
the degree, namely the number of neighbors of a given 
gene [38]. 

Prioritizing candidate genes by interconnectedness scores 

Candidate genes are then prioritized based on the ICN 
scores calculated using equation 1. For a given disease 
d, each candidate gene was scored by summing up the 
closeness to the seed genes S d , i.e. the genes in the same 
disease family. The score of a given candidate gene i 
was calculated as: 

score i = — '— - "V ICN { f (o\ 

where ICN it j is the connection score between gene i 
and /. All candidate genes are then ranked based on 
these scores. 

Implement of random walk (RW) and PRINCE (PR) 
methods 

Both the random walk (RW) method [17] and the 
PRINE (PR) algorithm [18] apply an iterative procedure 
to find candidate disease genes in a network. When the 
difference between results of the previous and current 
steps (measured by Li-norm) fell below 10" 10 , the itera- 
tion was halted, and candidate genes were ranked based 
on the scores in the final step. 

The precise behaviors employed by the two methods 
to reach candidate genes in a network differ. RW [17] 
simulates a random walker that starts from one or a set 
of source nodes, and moves forward to neighboring 
nodes with a probability proportional to the weight of 
the connecting edge. RW also allows the walker to 
move back to the source node with probability r in each 
step, r controls how far the random walker could get 
away from the source node. PR [18], a propagation- 
based algorithm, exploits prior information on causative 
genes for the same disease or similar ones and infers a 
strength-of-association function to smooth over the net- 
work {i.e. adjacent nodes are assigned similar values). 
The parameter a in PR controls the relative importance 
of prior information. Using the tuning procedure 
described in [18], we set r = 0.5 and a = 0.9, which 
make corresponding methods achieve the optimal per- 
formance when the two network data sets described in 
this study are used. 

Experiment design and performance measurement 

Two test scenarios were designed to evaluate the perfor- 
mance of all methods: simulated linkage analyses and 
whole genome scan. In the simulated linkage analysis, a 
total of 100 genes flanking a test disease gene were 



taken as the candidate genes. In the whole genome scan, 
a test disease gene and all the genes in a biological net- 
work excluding other members from the corresponding 
disease gene family constitute the candidate gene list. 

A leave-one-out procedure is used to assess the per- 
formance of the different methods. In each trial, a dis- 
ease-gene association was removed and remaining genes 
in the same disease family were taken as seed genes to 
reconstruct the association. We used the "success rate" 
to represent the performance of a method. If the 
removed disease-gene association was ranked in top k of 
a candidate gene list, this trial was regarded as a suc- 
cessful prediction. The "success rate" of a method is 
defined as the fraction of successful predictions in all 
cases tested given a particular combination of a network 
data set and a test scenario. 

Combing the prioritization results given by different 
methods 

For each candidate gene i, a combined score CS t was 
calculated as: 

n 

i=i 

where R it j indicates the rank of gene i in method 
The candidate genes were re-ranked using the combined 
scores in an ascending order, i.e. the lower combined 
score, the higher priority. 

Additional material 



Additional file 1: List of related algorithms and tools for prioritizing 
disease candidate genes 

Additional file 2: Analysis of network topological properties on 
disease causing genes The topological properties of disease genes in 
unique cases which were successfully ranked the known disease genes 
as top 1 candidate by a specific method in FAN (Figure 3B) were 
compared in degree (A) and average shortest-path distance between 
other disease-associated genes which are in the same disease family(B). 
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